Add short term storage expiration indicator to history items #20332

davelopez · 2025-05-22T11:33:14Z

xref #20169

This simple approach should not be too expensive and can help the user identify when a dataset might be gone because it is stored in a short-term object store.

This works just by annotating the object store config with a new property object_expires_after_days:

    - id: scratch
      type: disk
      device: device2
      weight: 0
      allow_selection: true
      private: true
      name: Scratch Storage
      description: >
          This object store is connected to institutional scratch storage. This disk space is not backed up and private to
          your user, and datasets belonging to this storage will be automatically deleted after one month.
      quota:
          source: second_tier
      files_dir: /home/dlopez/sandbox/data-gx/dev/objects/temp
      badges:
          - type: faster
          - type: less_stable
          - type: not_backed_up
          - type: short_term
            message: The data stored here is purged after a month.
      object_expires_after_days: 30

There are still some drawbacks to consider/resolve:

Synchronize the object store config property object_expires_after_days with the actual expiration time of the object store. It seems the cleanup of the object store is handled by external processes, so this value must be in sync with the actual expiration time of the object store.
Collections do not have an object_store_id property. I wonder if we could "estimate" or "assume" the object store ID of a collection by looking at the object store ID of the first dataset in the collection. This is not ideal, but maybe it could be a good enough workaround? I'm not sure how often collection elements are stored in mixed object stores, but I guess it could happen.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

jmchilton · 2025-05-22T13:47:40Z

I'm anxious about this idea for a few reasons but Anton is the boss 🤷‍♀️.

If you descend into collections in the history panel - do you get the icon on individual datasets there? The query that summarizes states across the whole collection could accumulate the object store IDS at the same time - it would be a wild query but it would probably the easiest and correct thing to do to summarize the dataset collection. I guess we couldn't get a count down in that case but we could add a storage temp storage icon with more information like a per dataset break down by clicking on it.

davelopez · 2025-05-22T14:33:05Z

I understand and share your concerns, especially with collections. I also think the other proposed solutions, like sending emails, are even more concerning. It would be really hard to do it right and not turn it into a massive spam generator, so likely not worth it 😅

If you descend into collections in the history panel - do you get the icon on individual datasets there?

I would say no. My idea was to do something less accurate, but enough to "inform" the user that the datasets or collections used will be temporal. I thought displaying something a the top level would be enough, if you drilled down that collection, you must have already seen the "indication" and we could still display it at the top.

I know it is technically possible to mix elements from different object stores in the same collection, but will this be a common case? I was hoping we could assume a single common object store for the HDCA by peeking into just one of its datasets. But yeah, in the worst case, we could do what you suggest, aggregating the object store IDs in the summarize query, and if there is at least one object store ID known to be short-term, just display some warning at the top. This would probably already be a huge improvement in raising awareness of the temporary nature of the selected storage without needing many more features.

jmchilton · 2025-05-23T14:08:11Z

And we're certain we cannot just take scratch away from people who complain? We "promote" them to a "higher tier" of user where all there data is permanent storage and advanced options are disabled. Not going to fly huh?

I know it is technically possible to mix elements from different object stores in the same collection, but will this be a common case?

It is probably uncommon but they are pretty easy to create and it would be my guess that they would be more common/have more obvious use cases than say mixing dbkeys or file extensions and we deal with a mix of those in the UI in a mostly "correct" fashion.

nsoranzo · 2025-05-23T15:46:33Z

The other option is not show the indicator at all for collections, but only when the user drills down to the dataset level.

davelopez · 2025-05-26T16:02:21Z

I made an attempt to include the set of object_store_ids as we do with dbKeys and extensions in 048e2d6, and then find the shortest time to expiration in any of them. I assume that as soon as one of the elements of a collection expires, the whole collection can be considered expired, as it can no longer be used completely.

Let me know if this is still a bad idea 😅

davelopez · 2025-05-30T12:25:12Z

I've run benchmarks on three different dataset collections: 1K, 5K, and 10K datasets. For each collection, I issued 100 requests to the endpoint:

api/dataset_collections/{collection_id}?view=summary

and recorded the minimum, maximum, and average response times (in milliseconds).

Without adding the `object_store_ids` field (the current code):

Collection	Min (ms)	Max (ms)	Avg (ms)
1K	25.85	81.60	44.77
5K	59.69	110.31	74.90
10K	107.08	182.18	125.92

With the object_store_ids field (the changes proposed in `d105def`):

Collection	Min (ms)	Max (ms)	Avg (ms)
1K	26.84	56.01	38.92
5K	66.64	128.78	80.28
10K	119.39	171.10	137.09

There is a slight increase in response time for larger collections, but maybe it's worth the tradeoff?

On the other hand, I noticed there is still an inaccuracy in this approach. This tracks all the object_store_ids, but it "assumes" the creation_time of the whole HDCA is the time for calculating the expiration, but it should consider "the oldest creation_time for each dataset in those object_stores" instead.

I will try to explore and benchmark adding the oldest creation_time to each object_store_id and see what we get...

davelopez · 2025-06-03T11:29:27Z

The new approach for collections in 1b8acc2 is more accurate as it takes into account the "oldest create_time" of the datasets associated with each object store used in the collection.

Of course, it is slightly slower too, but again, it may be worth the extra time.

Collection	MIN (ms)	MAX (ms)	AVG (ms)
1K	28.97	57.46	39.95
5K	69.89	132.95	86.03
10K	125.86	170.74	144.08

Average Response Time Comparison (ms)

Collection	Without `object_store_ids`	With `object_store_ids`	With `store_times_summary`
1K	44.77	38.92	39.95
5K	74.90	80.28	86.03
10K	125.92	137.09	144.08

This is an optional property that can indicate the number of days that an object (file) will be stored in a short term storage.

To display expiration status of datasets stored in a short term object store.

Reusing the same query for dbKeys and extensions, we get a unique set of object_store_ids where the elements of the collection are stored.

In case of multiple object stores, we pick the one with the shortest expiration time as we can assume that as soon as the first element expires, the entire collection should be considered "expired" since we cannot access all elements anymore.

And update tests

I don't remember exactly why this was set to optional, but It seems the default of the database field will always be datetime.now so it makes more sense to make the value required.

…leItem

…ired test To handle mock datasets when serializing the collection during export

This provides an accurate expiration date for collections with mixed object stores and creation dates for its datasets.

davelopez added kind/enhancement area/histories area/objectstore labels May 22, 2025

davelopez force-pushed the explore_short_term_storage_expiration_indicator branch from c5b788c to 86935aa Compare May 27, 2025 08:00

This was referenced May 27, 2025

Refactor CollectionDescription component props to use HDCASummary #20356

Open

Update create_time field to be required in history content items #20357

Open

davelopez force-pushed the explore_short_term_storage_expiration_indicator branch from 86935aa to a0fd976 Compare May 27, 2025 10:08

davelopez force-pushed the explore_short_term_storage_expiration_indicator branch from 2213bb5 to 1b8acc2 Compare June 3, 2025 11:12

davelopez added 14 commits June 5, 2025 14:57

Add object_expires_after_days property to ConcreteObjectStore

f6c9814

This is an optional property that can indicate the number of days that an object (file) will be stored in a short term storage.

Add ContentExpirationIndicator component

1ea748c

To display expiration status of datasets stored in a short term object store.

Add missing object_store_id field to HDASummary schema

9c5a189

Add object_store_ids field to HDCASummary and update serializer

4b59ae4

Reusing the same query for dbKeys and extensions, we get a unique set of object_store_ids where the elements of the collection are stored.

Refactor CollectionDescription component props to use HDCASummary

4f0958c

And update tests

Update create_time field to be required in history contents

764605c

I don't remember exactly why this was set to optional, but It seems the default of the database field will always be datetime.now so it makes more sense to make the value required.

Set create_time field to a default value in tests

1f6aefe

Allow null values for object_store_id and object_store_ids in Expirab…

9ba4fc4

…leItem

Display ContentExpirationIndicator in CollectionDetails component

7f7590d

Patch serialize_object_store_ids in test_export_dataset_collection_pa…

1db09a2

…ired test To handle mock datasets when serializing the collection during export

Replace object_store_ids with store_times_summary

be0f59d

This provides an accurate expiration date for collections with mixed object stores and creation dates for its datasets.

Adapt test_export_dataset_collection_paired test

634efb9

Enhance expiration tooltip to specify item type for clarity

41eaaf6

davelopez force-pushed the explore_short_term_storage_expiration_indicator branch from 076d536 to 41eaaf6 Compare June 5, 2025 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add short term storage expiration indicator to history items #20332

Add short term storage expiration indicator to history items #20332

Uh oh!

davelopez commented May 22, 2025 •

edited

Loading

Uh oh!

jmchilton commented May 22, 2025

Uh oh!

davelopez commented May 22, 2025

Uh oh!

jmchilton commented May 23, 2025

Uh oh!

nsoranzo commented May 23, 2025

Uh oh!

davelopez commented May 26, 2025

Uh oh!

davelopez commented May 30, 2025

Uh oh!

davelopez commented Jun 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add short term storage expiration indicator to history items #20332

Are you sure you want to change the base?

Add short term storage expiration indicator to history items #20332

Uh oh!

Conversation

davelopez commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test the changes?

License

Uh oh!

jmchilton commented May 22, 2025

Uh oh!

davelopez commented May 22, 2025

Uh oh!

jmchilton commented May 23, 2025

Uh oh!

nsoranzo commented May 23, 2025

Uh oh!

davelopez commented May 26, 2025

Uh oh!

davelopez commented May 30, 2025

Without adding the object_store_ids field (the current code):

With the object_store_ids field (the changes proposed in d105def):

Uh oh!

davelopez commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Average Response Time Comparison (ms)

Uh oh!

Uh oh!

davelopez commented May 22, 2025 •

edited

Loading

Without adding the `object_store_ids` field (the current code):

With the object_store_ids field (the changes proposed in `d105def`):

davelopez commented Jun 3, 2025 •

edited

Loading